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ABSTRACT 

Summary: Because an enormous amount of sequence data is being 
collected, a method to effectively display sequence variation informa- 
tion is urgently needed, tasuke is a web application that visualizes 
large-scale resequencing data generated by next-generation sequen- 
cing technologies and is suitable for rapid data release to the public 
on the web. The variation and read depths of multiple genomes, 
as well as annotations, can be shown simultaneously at various 
scales. We demonstrate the use of TASUKE by applying it to 50 rice 
and 100 human genome resequencing datasets. 
Availability and implementation: The tasuke program package and 
user manual are available from http://tasuke.dna.affrc.go.jp/. 
Contact: taitoh@affrc.go.jp 
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1 INTRODUCTION 

Recent advances in next- generation sequencing (NGS) tech- 
nologies have allowed the rapid production of a tremendous 
amount of genomic sequence data at a low cost. This has 
naturally led to the resequencing of hundreds or thousands of 
genomes, such as the 1000 human genomes (http://www.1000 
genomes.org/) and the 1001 genomes project in Arabidopsis 
(http://www.1001genomes.org/). A method for comparing 
dozens of genomes in an effective manner is, therefore, urgently 
needed. Although a few stand-alone programs for comparative 
genome visualization have been developed (Fiume et ai, 2010; 
Preston et ai, 2012; Thorvaldsdottir et ai, 2013), to our know- 
ledge, there is no web-based application that can handle 
dozens or more resequencing data of large genomes from 
higher eukaryotes. 

The basic requirements of a visualization program for 
genome-wide resequencing data are as follows. First, a large 
amount of data obtained from tens or hundreds of samples 
from a species with genomes of >100 Mb need to be displayed 
in a smooth manner. An overview of NGS read mapping results 
needs to be shown so that users can grasp read coverage of a 
genome at the hundred- to million-base scale at a glance. Second, 
the use of storage and memory resources for the data browser 
should be minimal and small enough to be handled by the 
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average computer server. It is not realistic to load an enormous 
amount of mapped read data from individual samples one by 
one with a stand-alone program on PC. Therefore, a client-server 
system, in which an efficient program runs on the server, is pre- 
ferred. Third, it should be possible to share the data with collab- 
orators or the public. In general, data from resequencing studies 
that are published as figures and tables in an article are not 
sufficient to reproduce a study's results, whereas raw short 
reads registered in sequence read archive databases and resultant 
single-nucleotide polymorphism (SNP) data are informative. For 
experimental researchers who seek polymorphisms at the 
genome-wide level, a browser that can effectively address hun- 
dreds of genomes and display a large number of polymorphic 
sites is needed. 

Here, we present tasuke, a web application for the visualiza- 
tion of large-scale resequencing data obtained from at least 
100 genomes. This application allows users to rapidly release 
their own data on the web. Variant frequencies, read coverage 
and gene annotation information are shown simultaneously at 
various scales, tasuke uses a window analysis so that users can 
get a bird's-eye view of the SNP density. 

2 FUNCTIONS AND APPLICATION 

For the sake of ease of use, tasuke was designed as web appli- 
cation implemented in HTML5. In this way, researchers can 
easily share data via general web browsers using a graphical 
interface. The input files required are as follows: a reference 
genome in FASTA format, Variant Call Format (VCF) files 
(Danecek et aL, 2011) and depth files created by the 'depth' 
command of SAM tools (Li et aL, 2009). Annotation files in 
General Feature Format (GFF, http://www.sanger.ac.uk/re 
sources/software/gff/) are optional. A MySQL database is also 
required for the website's backend data management, tasuke 
helps bioinformatics researchers of genome -wide resequencing 
projects to visualize a large amount of polymorphisms on mul- 
tiple genomes and to release the data to the public. 

On the upper pane, the reference genome and annotation in- 
formation are displayed (Fig. lb and c). Users can choose a 
specific position by clicking on the selected chromosome or 
moving a slider in the upper right region. Alternatively, the top 
menu bar provides users with a search function to find identifiers 
or genomic positions (Fig. la). Nucleotide variations (SNPs and 
length polymorphisms) and depth of mapped reads are presented 
in the lower main pane, which can be dragged to the left or right 
(Fig. le). The depth information is important to distinguish 
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Fig. 1. Screenshots of tasuke. (A) A view showing variants of 500 bp/ 
block scale, (a) Menu bar for various functions, (b) Chromosomal pos- 
itions, (c) Annotation tracks, (d) Sample names and related information, 
(e) Main panel for variant frequencies of block regions, (f) Magnified 
view of blocks. Blocks without reads are yellow, (g) Indicator for variant 
frequency and depth, (h) Overall SNP density. (B) A view showing vari- 
ants and depth of 1 bp/block scale, (i) Amino acids and nucleotides on a 
reference genome, (j) Variants and their effects, (k) Indicator of levels 
of variant effects. (1) Variants and average depth information are shown 
by clicking on a block, (m) Variant effects are shown by clicking on the 
sub-window of (1) 



whether the region has no SNPs or no mapped reads, which are 
generally omitted in VCF format. The frequency of variation 
occurrence and/or average depth are shown in a block that cor- 
responds to a region scalable from 1 bp to 100 kb with colored 
gradations: blue for SNPs, red for insertions/deletions and gray 
or yellow for depth (Fig. If). The maximum number of blocks 
displayed in a window is 200, so that up to 20 Mb can be viewed. 
At the most precise level, individual nucleotides and translated 
amino acids can be shown (Fig. IB). By clicking on a block, a 
window of detailed information about nucleotide variations and 
depth pops up (Fig. 11). To find mutations that possibly affect 
phenotypes, the effect information of each variant, such as non- 
synonymous changes and frame shifts, which can be added to a 
VCF file by snpEff (Cingolani et aL, 2012), is shown by selecting 
'snpEFF' in the menu bar (Fig. 11 and m). If a sample name is 
clicked, the reference genome is reset to the selected sample and 
variant frequencies are recalculated for all genomes. This refer- 
ence switch function is useful to look over variations derived 
from different origins. From the 'Tools' menu (Fig. la), users 
can export a list of variant information, which is described in a 
tab-delimited file of a specified region of up to 200 kb. An image 
file of the displayed area is also downloadable. 



As a demonstration, we applied tasuke to resequencing data 
from rice and human samples so that users can experience the 
functions of tasuke. First, we used resequencing data from 
50 rice genomes at ~15x coverage (Xu et aL, 2011), which was 
downloaded from the DDBJ Sequence Read Archive (Kodama 
et aL, 2010). The short-reads of rice were mapped to the reference 
genome (Sakai et aL, 2013) by BWA (Li and Durbin, 2009). 
Second, alignments of human genome resequencing data 
generated by the 1000 Genomes Project were downloaded 
(ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/), and 100 individ- 
uals who represent subpopulations were arbitrarily selected. 
Variant and read depth information were obtained from both 
datasets by using SAM tools. The annotations from the Rice 
Annotation Project Database (Sakai et aL, 2013) and Ensembl 
(Flicek et aL, 2013) were also stored in MySQL databases. These 
datasets are accessible through tasuke at http://tasuke.dna.affrc. 
go.jp/- 



3 CONCLUSION 

tasuke is designed for the visualization and rapid release of 
large-scale resequencing data on the web. This application 
allows users to see variant frequencies, read depth and annota- 
tion information in a scalable and smooth manner. We demon- 
strated its functionality through application to resequencing data 
from the rice and human genomes. This application is useful 
for the analysis of other genome -wide NGS data obtained 
from large samples. In future, to cope with growing resequencing 
data as well as RNA-seq and other NGS data, we will further 
improve tasuke. 
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